2 Data Mining Version Histories 2.1 Learning from History 2.2 Mining Rules

نویسندگان

  • Thomas Zimmermann
  • Andreas Zeller
چکیده

Program analysis long has been understood as the analysis of source code alone. A modern software product, though, is more than just program code; it contains documentation, interface descriptions, resource data—all of which must be maintained and organized. In this paper, we propose a novel approach to maintain such non-program entities: By learning from the development history of the product, we can determine coupling between entities: “Programmers who changed ComparePreferencePage.java typically also changed plugin.properties”. As a first proof of concept, our ROSE plug-in for ECLIPSE automatically guides the programmer along related changes. 2.1 Learning from History Shopping for a book at Amazon.com, you may have come across a section that reads “Customers who bought this book also bought. . . ”, listing other books that were typically included in the same purchase. Such information is gathered by data mining— the automated extraction of hidden predictive information from large data sets. We have applied such data mining to the version histories of large open-source software systems. This results in rules like the following: Coupling between entities: “Programmers who changed the fkeys[] field always also changed the initDefaults() function”. The initDefaults() function initializes new elements of the fkeys[] field; whenever fkeys[] was extended by a new element, initDefaults() was extended by a statement that initialized the element. Coupling between programs and documentation: “In 8 out of 10 cases, Programmers who changed the embedded SQL statement in line 47 of status.py changed the JPEG image igordb.jpg”. The JPEG image is part of the product documentation and is a view of the database schema; whenever the schema changed, the SQL statements were changed, too, and the documentation was updated. Such rules can reveal invariants of the development process (such as updating documentation); they can reveal factual coupling through common changes; and they can be put to use for actual programmers. Figure 2.1 shows our ROSE plug-in for the ECLIPSE programming environment, actually working on the ECLIPSE source code: As soon as the programmer makes a change to fkeys[], ROSE suggests further related changes as listed above.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TEES 2.1: Automated Annotation Scheme Learning in the BioNLP 2013 Shared Task

We participate in the BioNLP 2013 Shared Task with Turku Event Extraction System (TEES) version 2.1. TEES is a support vector machine (SVM) based text mining system for the extraction of events and relations from natural language texts. In version 2.1 we introduce an automated annotation scheme learning system, which derives task-specific event rules and constraints from the training data, and ...

متن کامل

Predictive analytics and data mining

2 Predictive analytics in general 9 2.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Data cleaning and recoding . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Interpreting coefficients of a linear model . . . . . . . . . . . . . . 13 2.5 Evaluating performance . . . . . . . . . . ...

متن کامل

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

On Mining Fuzzy Classification Rules for Imbalanced Data

Fuzzy rule-based classification system (FRBCS) is a popular machine learning technique for classification purposes. One of the major issues when applying it on imbalanced data sets is its biased to the majority class, such that, it performs poorly in respect to the minority class. However many cases the minority classes are more important than the majority ones. In this paper, we have extended ...

متن کامل

Employing data mining to explore association rules in drug addicts

Drug addiction is a major social, economic, and hygienic challenge that impacts on all the community and needs serious threat. Available treatments are successful only in short-term unless underlying reasons making individuals prone to the phenomenon are not investigated. Nowadays, there are some treatment centers which have comprehensive information about addicted people. Therefore, given the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004